[WIP] Update CLDR data to 44.1.0 and add hour/duration support#1343
Open
serhii73 wants to merge 3 commits into
Open
[WIP] Update CLDR data to 44.1.0 and add hour/duration support#1343serhii73 wants to merge 3 commits into
serhii73 wants to merge 3 commits into
Conversation
- Upgrade CLDR source data from version 31.0.1 to 44.1.0: 288 new locales added, all existing locale JSONs updated in dateparser_data/cldr_language_data/date_translation_data/ - Add hour/duration unit to relative-type data for all locales - Regenerate all dateparser/data/date_translation_data/*.py from the new source data, keeping master's possessive-quantifier optimization in relative-type-regex patterns - Modernize dateparser_scripts/ to use pathlib.Path and unified cldr-json repository layout (utils.py, write_complete_data.py, get_cldr_data.py, order_languages.py) - Restore Ukrainian words removed by the CLDR upgrade (uk.yaml) - Update docs/supported_locales.rst and languages_info.py - Update 34 test inputs in test_languages.py to reflect CLDR 44.1.0 abbreviation/name changes (bs-Latn, ce, kl, qu, so, sr, sw, zu, am, as, brx, hy, ig, kok, mr, nn, de, eu, gu, mk, chr, bs-Cyrl) - Add /cldr-json/ to .gitignore Finishes PR #1216 (Gallaecio:cldr-update). Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Closed
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #1343 +/- ##
==========================================
- Coverage 97.11% 92.53% -4.59%
==========================================
Files 235 379 +144
Lines 2909 3053 +144
==========================================
Hits 2825 2825
- Misses 84 228 +144 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
- en.yaml: restore 'in {0} weeks time' and 'in {0} weeks\' time' patterns
for 'in \1 week' that CLDR 44 dropped; regenerate en.py
- af.yaml: restore 'sek' (second abbreviation) dropped by CLDR 44; regenerate af.py
- en-US: CLDR 44 has no en-US.json (US settings merged into base en);
add minimal en-US.json, regenerate en-US.py, add en-US back to
languages_info.py locale list for 'en'
- tests/test_freshness_date_parser.py: update Cherokee (chr) test input
from uppercase to lowercase Cherokee encoding used by CLDR 44 patterns
- tests/test_search.py: update Danish detection test to text that includes
'tirsdag' and 'januar' (distinctly Danish vs Swedish), since CLDR 44
sv.py changes caused the old text to be ambiguous
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
The en-US locale is loaded via en.py's locale_specific["en-US"] section,
not as a standalone language file. Adding it to en.yaml ensures the
locale_specific entry gets the correct name ("en-US") and date_order
(MDY). Remove the standalone en-US.json that was added by mistake.
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Comment on lines
-499
to
+504
| "en-US", | ||
| "en-VC", | ||
| "en-VG", | ||
| "en-VI", | ||
| "en-VU", | ||
| "en-WS", | ||
| "en-US", |
Contributor
There was a problem hiding this comment.
This looks weirdly unnecessary.
|
|
||
| json_dict = {} | ||
| json_dict = OrderedDict() |
| @@ -61,7 +61,7 @@ et | |||
| eu | |||
| ewo | |||
| fa 'fa-AF' | |||
| ff 'ff-CM', 'ff-GN', 'ff-MR' | |||
| ff | |||
Contributor
There was a problem hiding this comment.
What does this mean in user terms? Would a user that was passing this somewhere now get an error? If so, could we handle this differently, provide support for these locales but with the same data as ff? (if that was not the case already)
Contributor
There was a problem hiding this comment.
Are the changes in test expectations necessary? (here and in the other test files)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
relative-typefor all locales (e.g. "2 hours ago" now parses correctly in more languages)dateparser/data/date_translation_data/*.pyfiles from the new source datadateparser_scripts/to usepathlib.Pathand the unifiedcldr-jsonrepository layoutuk.yaml)docs/supported_locales.rstandlanguages_info.py/cldr-json/to.gitignoreFinishes #1216 (by @Gallaecio). The original PR was a draft with merge conflicts; this re-implements the same work cleanly on top of current master, preserving the possessive-quantifier optimization from #1335.
Test plan
python -m pytest tests/test_languages.py)dateparser.parse("2 hours ago")anddateparser.parse("vor 2 Stunden")both return valid dates🤖 Generated with Claude Code